On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data
نویسندگان
چکیده
Single-cell RNA-Sequencing (scRNA-Seq) has become the most widely used high-throughput method for transcription profiling of individual cells. Systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies. Surprisingly, these issues have received minimal attention in published studies based on scRNA-Seq technology. We examined data from five published studies and found that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we found that the proportion of genes reported as expressed explains a substantial part of observed variability and that this quantity varies systematically across experimental batches. Furthermore, we found that the implemented experimental designs confounded outcomes of interest with batch effects, a design that can bring into question some of the conclusions of these studies. Finally, we propose a simple experimental design that can ameliorate the effect of theses systematic errors have on downstream results. . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/025528 doi: bioRxiv preprint first posted online Aug. 25, 2015; Single-cell RNA-Sequencing (scRNA-Seq) has become the primary tool for profiling the transcriptomes of hundreds or even thousands of individual cells in parallel. Our experience with high-throughput genomic data in general, is that well thought-out data processing pipelines are essential to produce meaningful downstream results. We expect the same to be true for scRNA-seq data. Here we show that while some tools developed for analyzing bulk RNA-Seq can be used for scRNA-Seq data, such as the mapping and alignment software, other steps in the processing, such as normalization, quality control and quantification, require new methods to account for the additional variability that is specific to this technology. One of the most challenging sources of unwanted variability and systematic error in highthroughput data are what are commonly referred to as batch effects. Given the way that scRNASeq experiments are conducted, there is much room for concern regarding batch effects. Specifically, batch effects occur when cells from one biological group or condition are cultured, captured and sequenced separate from cells in a second condition. Although batch information is not always included in the experimental annotations that are publicly available, one can extract surrogate variables from the raw sequencing (FASTQ) files. Namely, the sequencing instrument used, the run number from the instrument and the flow cell lane. Although the sequencing is unlikely to be a major source of unwanted variability, it serves as a surrogate for other experimental procedures that very likely do have an effect, such as starting material, PCR amplification reagents/conditions, and cell cycle stage of the cells. Here we will refer to the resulting differences induced by different groupings of these sources of variability as batch effects. In a completely confounded study, it is not possible to determine if the biological condition or batch effects are driving the observed variation. In contrast, incorporating biological replicates across in the experimental design and processing the replicates across multiple batches permits observed variation to be attributed to biology or batch effects (Figure 1). To demonstrate the widespread problem of systematic bias, batch effects, and confounded experimental designs in scRNA-Seq studies, we surveyed several published data sets. We discuss the consequences of failing to consider the presence of this unwanted technical variability, and consider new strategies to minimize its impact on scRNA-Seq data. . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/025528 doi: bioRxiv preprint first posted online Aug. 25, 2015;
منابع مشابه
A Graph-Based Clustering Approach to Identify Cell Populations in Single-Cell RNA Sequencing Data
Introduction: The emergence of single-cell RNA-sequencing (scRNA-seq) technology has provided new information about the structure of cells, and provided data with very high resolution of the expression of different genes for each cell at a single time. One of the main uses of scRNA-seq is data clustering based on expressed genes, which sometimes leads to the detection of rare cell populations. ...
متن کاملA Graph-Based Clustering Approach to Identify Cell Populations in Single-Cell RNA Sequencing Data
Introduction: The emergence of single-cell RNA-sequencing (scRNA-seq) technology has provided new information about the structure of cells, and provided data with very high resolution of the expression of different genes for each cell at a single time. One of the main uses of scRNA-seq is data clustering based on expressed genes, which sometimes leads to the detection of rare cell populations. ...
متن کاملI-13: Transcriptome Dynamics of Human and Mouse Preimplantation Embryos Revealed by Single Cell RNA-Sequencing
Background: Mammalian preimplantation development is a complex process involving dramatic changes in the transcriptional architecture. However, it is still unclear about the crucial transcriptional network and key hub genes that regulate the proceeding of preimplantation embryos. Materials and Methods: Through single-cell RNAsequencing (RNA-seq) of both human and mouse preimplantation embryos, ...
متن کاملI-42: Origins and Differentiation of Somatic Progenitors of The Mammalian Gonad Revealed by Single Cell RNA-Seq
Background - MaterialsAndMethods N;Results N;Conclusion N;
متن کاملAccounting for technical noise in differential expression analysis of single-cell RNA sequencing data
Recent technological breakthroughs have made it possible to measure RNA expression at the single-cell level, thus paving the way for exploring expression heterogeneity among individual cells. Current single-cell RNA sequencing (scRNA-seq) protocols are complex and introduce technical biases that vary across cells, which can bias downstream analysis without proper adjustment. To account for cell...
متن کامل